Climate change is a global challenge jeopardizing humanity’s future. This purpose of this project is to investigate the relationship between CO2 emissions, one of the largest drivers of climate change creating greenhouse gasses, and the use of renewable energy such as solar, wind, or hydroelectric energy. The data to be used in this project incorporates data from two different sources, one measuring total carbon dioxide emissions per country over the years and one measuring proportion of energy use which is renewable over the years. Total_CO2_Emission is in 1000 tons of CO2 and Percent_Consumption_Renewables is a percentage of total energy use produced by renewable sources per country over the years. Because the data is collected on individual countries over the years, this investigation will primarily focus on the Average Emissions as well as the Average Percent Renewable of all the countries in a specific year. In the data of which the averages will be investigated, there are 207 distinct countries observed over 29 years. The relationship between Average Emissionsand Average Percent Renewable is hypothesized to be negative whereas humanity used more renewable energy on average there would be less average CO2 emissions. Through this investigation hopefully more information can be learned about climate change and potential solutions to this global problem affecting every organism on planet Earth.
2 Materials and Methods
The master data to be used for analysis in this project incorporates the averages for all countries over the years from two different data sets, one measuring the total carbon dioxide emissions, per country over the years and the other measuring the proportion of energy use classified as renewable per country over the years. The data sets were sourced from an online database titled Gapminder which has been a reliable provider of data since 2005, in hopes of promoting global sustainability through easily accessible information.
Before analysis, the data had to be properly cleaned and wrangled. The CO2 data set contained values in the form of 25.3k and 4.9M, signaling the units of thousands and millions. The first step in the data cleaning process was to replace these characters with numbers to create a numeric variable which calculations could be performed on. Once the data was of all the same type, the data was then pivoted to be in tidy form and the years of interest (1989 – 2017) were selected. Once pivoted, the data sets were joined by year and country to produce a clean master data set which shall be used for model fitting, plotting, and analysis. A final decision was made to drop all the Na’s in the master set due to a few reasons. The missing values typically occurred in the Average Percent Renewable variable across the earlier years measured and usually in countries with smaller populations. Whether these Na’s were included in the data due to lack of observation or for simply not having any renewable energy production or consumption it is unclear. Because the averages are being studied and the Na’s usually occured in smaller countries, there was still a large enough sample size to average over once those observations were dropped. Other forms of imputation were contemplated such as cell-mean imputation but not conducted due to fears of introducing bias to the data.
Linear regression, a method involving predicting the values of one variable, based on another, through producing a straight line minimizing the value for the sum of squared residuals, was used to create and predict a model. All the data cleaning, models, and subsequent analysis was conducted using R code.
Code
datatable((head(master, n =50)),caption ='Interactive Preview of Data Set')
<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> 5b7bbb9045f9cf8201a690ba8e5c5c84ba477ebd
=======
>>>>>>> 2d6ac341fee300eb63ebafc8e1983aa7aacb6262
Above is the simple linear regression equation based on the model predicting the response variable, Average Emissions (\(\hat{y}\)), by the explanatory variable, Average Percent Renewable. The coefficient on Average Percent Renewable is extremely negative sitting at -17770.2 meaning that for each one percent increase in the Average Percent Renewable energy used the average amount of CO2 emissions in thousands of tons decreases by 17770.2. The slope coefficient of 693309.6 means that when the Average Percent Renewable is zero such that there is no renewable energy being used at all on average, the predicted Average Emissions would be 693309.6 thousand tons.
Code
<<<<<<< HEAD
# Plot 1raw_graph <- master |>ggplot(aes(x =`Average Percent Renewawable`, y =`Average Emissions`) ) +geom_point() +geom_smooth(method ="lm") +theme(legend.position ="none") +labs(x =" Average Renewable Energy Consumption (%)", y ="", title ="Relationship between Renewable Energy Usage and CO2 Emissions", subtitle ="Average CO2 Emissions (1000 tonnes)") +theme(plot.title =element_text(hjust =0.5, face ='bold'),plot.subtitle =element_text(size =10),axis.title.x =element_text(size =10))raw_graph
=======
# Plot 1raw_graph <- master |>ggplot(aes(x =`Average Percent Renewable`, y =`Average Emissions`) ) +geom_point() +geom_smooth(method ="lm") +theme(legend.position ="none") +labs(x =" Average Renewable Energy Consumption (%)", y ="", title ="Relationship between Renewable Energy Usage and CO2 Emissions", subtitle ="Average CO2 Emissions (1000 tonnes)") +theme(plot.title =element_text(hjust =0.5, face ='bold'),plot.subtitle =element_text(size =10),axis.title.x =element_text(size =10))raw_graph
>>>>>>> 5b7bbb9045f9cf8201a690ba8e5c5c84ba477ebd
The graph above demonstrates the relationship between Average Emissions and Average Percent Renewable. The distribution illustrates a negative linear relationship, where the points are relatively close to the plotted regression line with little deviation and noise. There are little to no unusual observations. This illustration is consistent with the hypothesis that as the Average Percent Renewable increases the Average Emissions decreases at a significant rate.
Code
<<<<<<< HEAD
co2_by_year_graph <- master |>ggplot(aes(x = Year , y =`Average Emissions`)) +geom_point() +scale_x_discrete(guide =guide_axis(n.dodge=2)) +labs(x ="Year", y ="", title ="Average CO2 Emissions Over Time", subtitle ="Average CO2 Emissions (1000 tonnes)") +theme(plot.title =element_text(hjust =0.5, face ='bold'))energy_by_year_graph <- master |>ggplot(aes(x = Year, y =`Average Percent Renewawable`)) +geom_point() +scale_x_discrete(guide =guide_axis(n.dodge=2)) +labs(x ="Year", y ="", title ="Average Percentage of Renewable Energy Over Time",subtitle ="Average Renewable Energy Consumption (%)") +theme(plot.title =element_text(hjust =0.5, face ='bold'))grid.arrange(co2_by_year_graph, energy_by_year_graph)
=======
co2_by_year_graph <-plot_ly( master, x =~ Year, y =~`Average Emissions`,type ='scatter',marker =list(color ='red'))co2_by_year_graph <- co2_by_year_graph |>layout(title ='Average Emissions Over Time',yaxis =list(title ='Average Emissions (1000 tons)',titlefont =list(size =14),xaxis =list(title ='Year',titlefont =list(size =14))))energy_by_year_graph <-plot_ly( master, x =~ Year, y =~`Average Percent Renewable`,marker =list(color ='green'),type ='scatter')energy_by_year_graph <- energy_by_year_graph |>layout(title ='Average Percent Renewable Over Time',yaxis =list(title ='Average Percent Renewable',titlefont =list(size =14)),xaxis =list(title ='Year',titlefont =list(size =14)))co2_by_year_graph
<<<<<<< HEAD
>>>>>>> 5b7bbb9045f9cf8201a690ba8e5c5c84ba477ebd
=======
>>>>>>> 2d6ac341fee300eb63ebafc8e1983aa7aacb6262
Code
energy_by_year_graph
As shown by the two distributions of Average Emissions and Average Percent Renewable over time, the relationship follows a negative relationship, but perhaps not as expected. Average Percent Renewable is decreasing over the years while Average Emissions is increasing which still illustrates a negative relationship. As time goes on it makes sense as to why Average Emissions is increasing, because of extreme population growth and growing demand for production but Average Percent Renewable has shockingly been declining in recent years. While this likely has something to due to the varying definition of renewable energy, for example whether or not nuclear energy is truly renewable, it is surprising that as technology develops renewable energy use does not. This signifies that humanity needs to increase the renewable energy production and usage on average in order to reduce the carbon footprint and preserve nature for future generations.
Model Fit:
Code
<<<<<<< HEAD
energy_emissions_model |>augment() |>summarize(`Variance of Fitted`=var(.fitted),`Variance of Residuals`=var(.resid),`Variance of Average CO2 Emissions`=var(`Average Emissions`)) |>kable(caption ='Model Fit',digits =3,format.args =list(big.mark =",")) |>kable_styling(bootstrap_options =c('striped', 'bordered'))
=======
energy_emissions_model |>augment() |>summarize(`Variance of Fitted`=var(.fitted),`Variance of Residuals`=var(.resid),`Variance of Average CO2 Emissions`=var(`Average Emissions`)) |>kable(caption ='Model Fit',digits =3,format.args =list(big.mark =",")) |>kable_styling(bootstrap_options =c('striped', 'bordered'))
>>>>>>> 5b7bbb9045f9cf8201a690ba8e5c5c84ba477ebd
Model Fit
Variance of Fitted
Variance of Residuals
Variance of Average CO2 Emissions
415,649,404
46,532,518
462,181,923
The proportion of variability in the response values that was accounted for by the model, \(R^{2}\), was very large at about at about 89.93 percent. This suggests a good quality model, where a lot, about 89%, of the variation in the response, Average Emissions is explained by the explanatory variable, Average Percent Renewable. This suggests that a high proportion of variability in response is accounted for by the linear model and there are not many other large factors influencing emissions.
Code
<<<<<<< HEAD
energy_emissions_model |>augment() |>ggplot(aes(x=.fitted, y = .resid)) +geom_point() +labs(y ='',subtitle ='Residuals',x ='Fitted Values',title ='Relationship between Residual and Fitted Values') +theme(plot.title =element_text(hjust =0.5, face ='bold'))
=======
energy_emissions_model |>augment() |>ggplot(aes(x=.fitted, y = .resid)) +geom_point() +labs(y ='',subtitle ='Residuals',x ='Fitted Values',title ='Relationship between Residual and Fitted Values') +theme(plot.title =element_text(hjust =0.5, face ='bold'))
>>>>>>> 5b7bbb9045f9cf8201a690ba8e5c5c84ba477ebd
Simulation:
Simulation is a critical technology to develop planning and explore models to optimized decisions making (de Paula Ferreira et al., 2020).
In this part, we will perform a basic linear model simulation to see how well the model is with the presetting conditions, such as adding normally distributed error to the linear regression line.
The basic procedure in this study is:
Make a linear regression fit model for the observed data (already done in the previous part).
Assume the model is right. Add generated error to the linear regression model (we generated normally distributed error for this study).
Getting the simulated data, and compare with the observed data (by generating value distribution graphs, scatterplots of the relationships modeled and observed, and y=x plot).
Check and interpret the simulated \(R^2\) value.
Iterating and generating simulated data sets.
Check, interpret, and plot the simulated \(R^2\) values for the simulated data sets.
Simulation for a single data set:
Code
<<<<<<< HEAD
noise <-function(x, mean =0, sd){ x +rnorm(length(x), mean, sd)}
=======
noise <-function(x, mean =0, sd){ x +rnorm(length(x), mean, sd)}
>>>>>>> 5b7bbb9045f9cf8201a690ba8e5c5c84ba477ebd
Code
<<<<<<< HEAD
master_predict <-predict(energy_emissions_model)master_sigma <-sigma(energy_emissions_model)sim_response <-tibble(sim_emissions =noise(master_predict, sd = master_sigma))raw_graph <- master |>ggplot(aes(x =`Average Percent Renewawable`, y =`Average Emissions`) ) +geom_point() +geom_smooth(method ="lm") +theme(legend.position ="none") +labs(x =" Avg Renewable energy consumption percentage", y ="", title ="Relationships between Avg Renewable Energy Consumption and \nAvg CO2 Emissions among different contries", subtitle ="Avg CO2 Emissions/1000 tonnes") +theme(plot.title.position ="plot")
=======
master_predict <-predict(energy_emissions_model)master_sigma <-sigma(energy_emissions_model)sim_response <-tibble(sim_emissions =noise(master_predict, sd = master_sigma))raw_graph <- master |>ggplot(aes(x =`Average Percent Renewable`, y =`Average Emissions`) ) +geom_point() +geom_smooth(method ="lm") +theme(legend.position ="none") +labs(x =" Avg Renewable energy consumption percentage", y ="", title ="Relationships between Avg Renewable Energy Consumption and \nAvg CO2 Emissions among different contries", subtitle ="Avg CO2 Emissions/1000 tonnes") +theme(plot.title.position ="plot")
We can see the differences between observed and simulated data from the distribution visualization. When we first look at the two graphs, they seem not quite similar. However, when we take a close look, the concentrated value ranges are similar.
The left graph described the distribution of the observed yearly emissions data counts world widely. We can see that the most data are within \(1.1\times10^6\) tonnes to \(1.2\times10^6\) tonnes and \(1.6\times10^6\) tonnes to \(1.7\times10^6\) tonnes. The simulated yearly emissions data are also concentrated in a range of \(1.1\times10^6\) tonnes to \(1.2\times10^6\) tonnes and \(1.5\times10^6\) tonnes to \(1.7\times10^6\) tonnes.
Code
<<<<<<< HEAD
sim_data <- master |>filter(!is.na(`Average Emissions`), !is.na(`Average Percent Renewawable`) ) |>select(`Average Emissions`, `Average Percent Renewawable`) |>bind_cols(sim_response)raw_graph <- master |>ggplot(aes(x =`Average Percent Renewawable`, y =`Average Emissions`) ) +geom_point() +theme(legend.position ="none") +labs(x =" Avg Renewable energy consumption percentage", y ="", title ="Observed Relationships between Avg Renewable Energy Consumption and \nAvg CO2 Emissions among different contries", subtitle ="Avg CO2 Emissions/1000 tonnes") +theme(plot.title.position ="plot")sim_master_graph <- sim_data |>ggplot(aes(x =`Average Percent Renewawable`, y = sim_emissions) ) +geom_point() +theme(legend.position ="none") +labs(x =" Avg Renewable energy consumption percentage", y ="", title ="Simulated Relationships between Avg Renewable Energy Consumption and \nAvg CO2 Emissions among different contries", subtitle ="Simulated Avg CO2 Emissions/1000 tonnes") +theme(plot.title.position ="plot")raw_graph + sim_master_graph
Error in `ggplot_add()`:
! Can't add `sim_master_graph` to a <ggplot> object.
=======
sim_data <- master |>filter(!is.na(`Average Emissions`), !is.na(`Average Percent Renewable`) ) |>select(`Average Emissions`, `Average Percent Renewable`) |>bind_cols(sim_response)raw_graph <- master |>ggplot(aes(x =`Average Percent Renewable`, y =`Average Emissions`) ) +geom_point() +theme(legend.position ="none") +labs(x =" Avg Renewable energy consumption percentage", y ="", title ="Observed Relationships between Avg Renewable Energy Consumption and \nAvg CO2 Emissions among different contries", subtitle ="Avg CO2 Emissions/1000 tonnes") +theme(plot.title.position ="plot")sim_master_graph <- sim_data |>ggplot(aes(x =`Average Percent Renewable`, y = sim_emissions) ) +geom_point() +theme(legend.position ="none") +labs(x =" Avg Renewable energy consumption percentage", y ="", title ="Simulated Relationships between Avg Renewable Energy Consumption and \nAvg CO2 Emissions among different countries", subtitle ="Simulated Avg CO2 Emissions/1000 tonnes") +theme(plot.title.position ="plot")raw_graph + sim_master_graph
In these two graphs, we plotted the relationships between simulated/ observed average renewable energy consumption and average CO2 emissions among different countries. We can see the the simulated data in the right graph is more close to a straight line, which is the modeled linear regression line.
Code
<<<<<<< HEAD
# Check the similarity between the simulated data and observed datasim_data |>ggplot(aes(x = sim_emissions, y =`Average Emissions`) ) +geom_point() +labs(x ="Simulated Avg CO2 Emissions/1000 tonnes", y ="",subtitle ="Avg CO2 Emissions/1000 tonnes" ) +geom_abline(slope =1,intercept =0, color ="steelblue",linetype ="dashed",lwd =1.5) +theme_bw()
=======
# Check the similarity between the simulated data and observed datasim_data |>ggplot(aes(x = sim_emissions, y =`Average Emissions`) ) +geom_point() +labs(x ="Simulated Avg CO2 Emissions/1000 tonnes", y ="",title ="The Similarity between Observed Data and Simulated Data",subtitle ="Avg CO2 Emissions/1000 tonnes" ) +geom_abline(slope =1,intercept =0, color ="steelblue",linetype ="dashed",lwd =1.5) +theme_bw()+theme(plot.title =element_text(hjust =0.5, face ='bold'))
In this graph, we can check the similarity between the data from observed data set and the simulated data set. Ideally, the data should fall on the blue dashed line, which is y=x line. The dots/ data above the blue line over estimated/ simulated. The dots/ data below the blue line is under estimated/ simulated. In this study, the observed data is more under estimated. Generally speaking, the dots are close to the y=x line. In this case, there is a strong relationship between the observed values and simulated values.
p value
Code
<<<<<<< HEAD
# check how the simulated data fit the observed data, especially the R square.lm(`Average Emissions`~ sim_emissions, data = sim_data ) |>glance()
# check how the simulated data fit the observed data, especially the P value and R square.p_value<-lm(`Average Emissions`~ sim_emissions, data = sim_data ) |>glance() |>select(p.value)|>pull()p_value
# get the R squared for the simulated data fit for observed data.sim_r2 <-lm(`Average Emissions`~ sim_emissions, data = sim_data ) |>glance() |>select(r.squared) |>pull()sim_r2
[1] 0.8210543
=======
# get the R squared for the simulated data fit for observed data.sim_r2 <-lm(`Average Emissions`~ sim_emissions, data = sim_data ) |>glance() |>select(r.squared) |>pull()sim_r2
p value is extremely small (p<0.05), which indicates the data is statistically significant.
\(R^2\) is 0.803297, which means the simulated data can explain around 80% variation of the observed data. In this case, we suppose the one time simulation till now did a fairly good job.
Since the normally distributed error is randomly added to the linear regression model line. If we iterate for more times, the result will be more stable. In this way, we can get the entire group of interest, rather than a single sample of that group.
In the following steps, we got 1000 simulated data sets and combined the observed data set to a new full data sets. Then, we also got the \(R^2\) of 1000 simulated data sets and plotted the distribution of them.
Code
<<<<<<< HEAD
# clean the colnamescolnames(sims) <-colnames(sims) |>str_replace(pattern ="\\.\\.\\.",replace ="_")head(sims)
# mapping to get 1000 simulated R squared.sim_r_sq <- sims |>map(~lm(`Average Emissions`~ .x, data = sims)) |>map(glance) |>map_dbl(~ .x$r.squared)head(sim_r_sq)
# mapping to get 1000 simulated R squared.sim_r_sq <- sims |>map(~lm(`Average Emissions`~ .x, data = sims)) |>map(glance) |>map_dbl(~ .x$r.squared)head(sim_r_sq)
Average Emissions sim_1 sim_2 sim_3
1.0000000 0.8030574 0.8284892 0.8438176
sim_4 sim_5
<<<<<<< HEAD
0.8151787 0.7634048
# to see the distribution of the 1000 simulated R square.tibble(sims = sim_r_sq) |>ggplot(aes(x = sims)) +geom_histogram(binwidth =0.025) +labs(x =expression("Simulated"~ R^2),y ="",subtitle ="Number of Simulated Models")
Code
#The distribution of these values will tell if our assumed model does a good job of producing data similar to what was observed. If the model produces data similar to what was observed, we would expect values near 1.# Our model is R square is around 0.8.
=======
# to see the distribution of the 1000 simulated R square.tibble(sims = sim_r_sq) |>ggplot(aes(x = sims)) +geom_histogram(binwidth =0.025) +labs(x =expression("Simulated"~ R^2),y ="",subtitle ="Number of Simulated Models",title ="1000 Simulated R-Squared Distribution")+theme(plot.title =element_text(hjust =0.5, face ='bold'))
Code
#The distribution of these values will tell if our assumed model does a good job of producing data similar to what was observed. If the model produces data similar to what was observed, we would expect values near 1.# Our model is R square is around 0.8.
>>>>>>> 5b7bbb9045f9cf8201a690ba8e5c5c84ba477ebd
We can see from the distribution of the 1000 simulated \(R^2\) graph that the \(R^2\) have the highest frequency around 0.8, which is in accordance with the single simulated data \(R^2\).
2. de Paula Ferreira, W., Armellini, F., & De Santa-Eulalia, L. A. (2020). Simulation in industry 4.0: A state-of-the-art review. Computers & Industrial Engineering, 149, 106868. https://doi.org/10.1016/J.CIE.2020.106868
Source Code
<<<<<<< HEAD
---title: "Investigating CO2 Emissions and Renewable Energy Usage"author: "Noel Lopez, Patrick George, Riley Svensson, Ningjing Hua"format: html: self-contained: true code-tools: true toc: true number-sections: true code-fold: trueeditor: sourceexecute: error: true echo: true message: false warning: false---```{r setup}library(tidyverse)library(broom)library(knitr)library(DT)library(kableExtra)library(gridExtra)library(patchwork)options(scipen =99999)energy <- readxl::read_xlsx(here::here("renewable_energy.xlsx"))co2 <- readxl::read_xlsx(here::here("co2.xlsx"))``````{r clean data}#| output: falseconvert_to_numeric <-function(x) { x <-str_replace_all(x, "k", "e3") x <-str_replace_all(x, "M", "e6")return(as.numeric(x))}co2_clean <- co2 |>select(country, `1989`:`2017`) |>mutate(across(.cols =`1989`:`2017`, ~convert_to_numeric(.))) |>pivot_longer(cols =!country,names_to ="year",values_to ="Total_CO2_Emissions")energy_clean <- energy |>pivot_longer(cols =!country,names_to ="year",values_to ="Percent_Consumption_Renewable")joined <-inner_join(co2_clean, energy_clean) |>drop_na()joined |>distinct(year) |>count()joined |>distinct(country) |>count()master <- joined|>rename(Year = year) |>group_by(Year) |>summarize(`Average Emissions`=mean(`Total_CO2_Emissions`),`Average Percent Renewawable`=mean(`Percent_Consumption_Renewable`))```# IntroductionClimate change is a global challenge jeopardizing humanity's future. This purpose of this project is to investigate the relationship between CO2 emissions, one of the largest drivers of climate change creating greenhouse gasses, and the use of renewable energy such as solar, wind, or hydroelectric energy. The data to be used in this project incorporates data from two different sources, one measuring total carbon dioxide emissions per country over the years and one measuring proportion of energy use which is renewable over the years. `Total_CO2_Emission` is in 1000 tons of CO2 and `Percent_Consumption_Renewables` is a percentage of total energy use produced by renewable sources per country over the years. Because the data is collected on individual countries over the years, this investigation will primarily focus on the `Average Emissions` as well as the `Average Percent Renewable` of all the countries in a specific year. In the data of which the averages will be investigated, there are 207 distinct countries observed over 29 years. The relationship between `Average Emissions`and `Average Percent Renewable` is hypothesized to be negative whereas humanity used more renewable energy on average there would be less average CO2 emissions. Through this investigation hopefully more information can be learned about climate change and potential solutions to this global problem affecting every organism on planet Earth.# Materials and MethodsThe master data to be used for analysis in this project incorporates the averages for all countries over the years from twodifferent data sets, one measuring the total carbon dioxide emissions, per country over the years and the other measuring the proportion of energy use classified as renewable per country over the years. The data sets were sourced from an online database titled *Gapminder* which has been a reliable provider of data since 2005, in hopes of promoting global sustainability through easily accessible information.Before analysis, the data had to be properly cleaned and wrangled. The CO2 data set contained values in the form of 25.3k and 4.9M, signaling the units of thousands and millions. The first step in the data cleaning process was to replace these characters with numbers to create a numeric variable which calculations could be performed on. Once the data was of all the same type, the data was then pivoted to be in tidy form and the years of interest (1989 -- 2017) were selected. Once pivoted, the data sets were joined by year and country to produce a clean master data set which shall be used for model fitting, plotting, and analysis. A final decision was made to drop all the Na's in the master set due to a few reasons. The missing values typically occurred in the `Average Percent Renewable` variable across the earlier years measured and usually in countries with smaller populations. Whether these Na's were included in the data due to lack of observation or for simply not having any renewable energy production or consumption it is unclear. Because the averages are being studied and the Na's usually occured in smaller countries, there was still a large enough sample size to average over once those observations were dropped. Other forms of imputation were contemplated such as cell-mean imputation but not conducted due to fears of introducing bias to the data.Linear regression, a method involving predicting the values of one variable, based on another, through producing a straight line minimizing the value for the sum of squared residuals, was used to create and predict a model. All the data cleaning, models, and subsequent analysis was conducted using R code.\n \n \n```{r}datatable((head(master, n =50)),caption ='Interactive Preview of Data Set')```\n# Analysis and Discussion of Model```{r}#| output: falseenergy_emissions_model <-lm(`Average Emissions`~`Average Percent Renewawable`,data = master)summary(energy_emissions_model)tidy(energy_emissions_model)```**Regression Equation:**$$ \hat{y} = 693309.6 - 17770.2 * Average\,Percent\,Renewable $$Above is the simple linear regression equation based on the model predicting the response variable, `Average Emissions` ($\hat{y}$), by the explanatory variable, `Average Percent Renewable`. The coefficient on `Average Percent Renewable` is extremely negative sitting at -17770.2 meaning that for each one percent increase in the `Average Percent Renewable` energy used the average amount of CO2 emissions in thousands of tons decreases by 17770.2. The slope coefficient of 693309.6 means that when the `Average Percent Renewable` is zero such that there is no renewable energy being used at all on average, the predicted `Average Emissions` would be 693309.6 thousand tons.\n```{r}#| fig-align: center# Plot 1raw_graph <- master |>ggplot(aes(x =`Average Percent Renewawable`, y =`Average Emissions`) ) +geom_point() +geom_smooth(method ="lm") +theme(legend.position ="none") +labs(x =" Average Renewable Energy Consumption (%)", y ="", title ="Relationship between Renewable Energy Usage and CO2 Emissions", subtitle ="Average CO2 Emissions (1000 tonnes)") +theme(plot.title =element_text(hjust =0.5, face ='bold'),plot.subtitle =element_text(size =10),axis.title.x =element_text(size =10))raw_graph```The graph above demonstrates the relationship between `Average Emissions` and `Average Percent Renewable`. The distribution illustrates a negative linear relationship, where the points are relatively close to the plotted regression line with little deviation and noise. There are little to no unusual observations. This illustration is consistent with the hypothesis that as the `Average Percent Renewable` increases the `Average Emissions` decreases at a significant rate.```{r}#| fig-align: centerco2_by_year_graph <- master |>ggplot(aes(x = Year , y =`Average Emissions`)) +geom_point() +scale_x_discrete(guide =guide_axis(n.dodge=2)) +labs(x ="Year", y ="", title ="Average CO2 Emissions Over Time", subtitle ="Average CO2 Emissions (1000 tonnes)") +theme(plot.title =element_text(hjust =0.5, face ='bold'))energy_by_year_graph <- master |>ggplot(aes(x = Year, y =`Average Percent Renewawable`)) +geom_point() +scale_x_discrete(guide =guide_axis(n.dodge=2)) +labs(x ="Year", y ="", title ="Average Percentage of Renewable Energy Over Time",subtitle ="Average Renewable Energy Consumption (%)") +theme(plot.title =element_text(hjust =0.5, face ='bold'))grid.arrange(co2_by_year_graph, energy_by_year_graph)```As shown by the two distributions of `Average Emissions` and `Average Percent Renewable` over time, the relationship follows a negative relationship, but perhaps not as expected. `Average Percent Renewable` is decreasing over the years while `Average Emissions` is increasing which still illustrates a negative relationship. As time goes on it makes sense as to why `Average Emissions` is increasing, because of extreme population growth and growing demand for production but `Average Percent Renewable` has shockingly been declining in recent years. While this likely has something to due to the varying definition of renewable energy, for example whether or not nuclear energy is truly renewable, it is surprising that as technology develops renewable energy use does not. This signifies that humanity needs to increase the renewable energy production and usage on average in order to reduce the carbon footprint and preserve nature for future generations.**Model Fit:**```{r}energy_emissions_model |>augment() |>summarize(`Variance of Fitted`=var(.fitted),`Variance of Residuals`=var(.resid),`Variance of Average CO2 Emissions`=var(`Average Emissions`)) |>kable(caption ='Model Fit',digits =3,format.args =list(big.mark =",")) |>kable_styling(bootstrap_options =c('striped', 'bordered'))```The proportion of variability in the response values that was accounted for by the model, $R^{2}$, was very large at about at about 89.93 percent. This suggests a good quality model, where a lot, about 89%, of the variation in the response, `Average Emissions` is explained by the explanatory variable, `Average Percent Renewable`. This suggests that a high proportion of variability in response is accounted for by the linear model and there are not many other large factors influencing emissions.```{r}#| fig-align: centerenergy_emissions_model |>augment() |>ggplot(aes(x=.fitted, y = .resid)) +geom_point() +labs(y ='',subtitle ='Residuals',x ='Fitted Values',title ='Relationship between Residual and Fitted Values') +theme(plot.title =element_text(hjust =0.5, face ='bold'))```\n**Simulation:**```{r}noise <-function(x, mean =0, sd){ x +rnorm(length(x), mean, sd)}``````{r}master_predict <-predict(energy_emissions_model)master_sigma <-sigma(energy_emissions_model)sim_response <-tibble(sim_emissions =noise(master_predict, sd = master_sigma))raw_graph <- master |>ggplot(aes(x =`Average Percent Renewawable`, y =`Average Emissions`) ) +geom_point() +geom_smooth(method ="lm") +theme(legend.position ="none") +labs(x =" Avg Renewable energy consumption percentage", y ="", title ="Relationships between Avg Renewable Energy Consumption and \nAvg CO2 Emissions among different contries", subtitle ="Avg CO2 Emissions/1000 tonnes") +theme(plot.title.position ="plot")``````{r}obs_emissions <- master |>ggplot(aes(x =`Average Emissions`)) +geom_histogram(binwidth =3000) +labs(x ="Observed Emissions",y ="",subtitle ="Count") +theme_bw()sim_emissions_graph <- sim_response |>ggplot(aes(x = sim_emissions)) +geom_histogram(binwidth =3500) +labs(x ="Simulated Emissions",y ="",subtitle ="Count") +theme_bw()obs_emissions + sim_emissions_graph``````{r}sim_data <- master |>filter(!is.na(`Average Emissions`), !is.na(`Average Percent Renewawable`) ) |>select(`Average Emissions`, `Average Percent Renewawable`) |>bind_cols(sim_response)raw_graph <- master |>ggplot(aes(x =`Average Percent Renewawable`, y =`Average Emissions`) ) +geom_point() +theme(legend.position ="none") +labs(x =" Avg Renewable energy consumption percentage", y ="", title ="Observed Relationships between Avg Renewable Energy Consumption and \nAvg CO2 Emissions among different contries", subtitle ="Avg CO2 Emissions/1000 tonnes") +theme(plot.title.position ="plot")sim_master_graph <- sim_data |>ggplot(aes(x =`Average Percent Renewawable`, y = sim_emissions) ) +geom_point() +theme(legend.position ="none") +labs(x =" Avg Renewable energy consumption percentage", y ="", title ="Simulated Relationships between Avg Renewable Energy Consumption and \nAvg CO2 Emissions among different contries", subtitle ="Simulated Avg CO2 Emissions/1000 tonnes") +theme(plot.title.position ="plot")raw_graph + sim_master_graph``````{r}# Check the similarity between the simulated data and observed datasim_data |>ggplot(aes(x = sim_emissions, y =`Average Emissions`) ) +geom_point() +labs(x ="Simulated Avg CO2 Emissions/1000 tonnes", y ="",subtitle ="Avg CO2 Emissions/1000 tonnes" ) +geom_abline(slope =1,intercept =0, color ="steelblue",linetype ="dashed",lwd =1.5) +theme_bw()``````{r}# check how the simulated data fit the observed data, especially the R square.lm(`Average Emissions`~ sim_emissions, data = sim_data ) |>glance() ``````{r}# get the R squared for the simulated data fit for observed data.sim_r2 <-lm(`Average Emissions`~ sim_emissions, data = sim_data ) |>glance() |>select(r.squared) |>pull()sim_r2``````{r}# Created 1000 simulated datasetnsims <-1000sims <-map_dfc(.x =1:nsims,.f =~tibble(sim =noise(master_predict, sd = master_sigma) ) )head(sims)``````{r}# clean the colnamescolnames(sims) <-colnames(sims) |>str_replace(pattern ="\\.\\.\\.",replace ="_")head(sims)``````{r}# bind 1000 simulated dataset and the observed dataset togethersims <- master |>filter(!is.na(`Average Emissions`), !is.na(`Average Percent Renewawable`)) |>select(`Average Emissions`) |>bind_cols(sims)head(sims)``````{r}# mapping to get 1000 simulated R squared.sim_r_sq <- sims |>map(~lm(`Average Emissions`~ .x, data = sims)) |>map(glance) |>map_dbl(~ .x$r.squared)head(sim_r_sq)``````{r}# to see the distribution of the 1000 simulated R square.tibble(sims = sim_r_sq) |>ggplot(aes(x = sims)) +geom_histogram(binwidth =0.025) +labs(x =expression("Simulated"~ R^2),y ="",subtitle ="Number of Simulated Models")#The distribution of these values will tell if our assumed model does a good job of producing data similar to what was observed. If the model produces data similar to what was observed, we would expect values near 1.# Our model is R square is around 0.8.```# Referenceshttps://www.gapminder.org/data/
=======
---title: "Investigating CO2 Emissions and Renewable Energy Usage"author: "Noel Lopez, Patrick George, Riley Svensson, Ningjing Hua"format: html: self-contained: true code-tools: true toc: true number-sections: true code-fold: trueeditor: sourceexecute: error: true echo: true message: false warning: false---```{r setup}library(plotly)library(tidyverse)library(broom)library(knitr)library(DT)library(kableExtra)library(gridExtra)library(patchwork)library(RColorBrewer)options(scipen =99999)energy <- readxl::read_xlsx(here::here("renewable_energy.xlsx"))co2 <- readxl::read_xlsx(here::here("co2.xlsx"))``````{r clean data}#| output: falseconvert_to_numeric <-function(x) { x <-str_replace_all(x, "k", "e3") x <-str_replace_all(x, "M", "e6")return(as.numeric(x))}co2_clean <- co2 |>select(country, `1989`:`2017`) |>mutate(across(.cols =`1989`:`2017`, ~convert_to_numeric(.))) |>pivot_longer(cols =!country,names_to ="year",values_to ="Total_CO2_Emissions")energy_clean <- energy |>pivot_longer(cols =!country,names_to ="year",values_to ="Percent_Consumption_Renewable")joined <-inner_join(co2_clean, energy_clean) |>drop_na()joined |>distinct(year) |>count()joined |>distinct(country) |>count()master <- joined|>rename(Year = year) |>group_by(Year) |>mutate(Year =as.numeric(Year)) |>summarize(`Average Emissions`=mean(`Total_CO2_Emissions`),`Average Percent Renewable`=mean(`Percent_Consumption_Renewable`))```# IntroductionClimate change is a global challenge jeopardizing humanity's future. This purpose of this project is to investigate the relationship between CO2 emissions, one of the largest drivers of climate change creating greenhouse gasses, and the use of renewable energy such as solar, wind, or hydroelectric energy. The data to be used in this project incorporates data from two different sources, one measuring total carbon dioxide emissions per country over the years and one measuring proportion of energy use which is renewable over the years. `Total_CO2_Emission` is in 1000 tons of CO2 and `Percent_Consumption_Renewables` is a percentage of total energy use produced by renewable sources per country over the years. Because the data is collected on individual countries over the years, this investigation will primarily focus on the `Average Emissions` as well as the `Average Percent Renewable` of all the countries in a specific year. In the data of which the averages will be investigated, there are 207 distinct countries observed over 29 years. The relationship between `Average Emissions`and `Average Percent Renewable` is hypothesized to be negative whereas humanity used more renewable energy on average there would be less average CO2 emissions. Through this investigation hopefully more information can be learned about climate change and potential solutions to this global problem affecting every organism on planet Earth.# Materials and MethodsThe master data to be used for analysis in this project incorporates the averages for all countries over the years from two different data sets, one measuring the total carbon dioxide emissions, per country over the years and the other measuring the proportion of energy use classified as renewable per country over the years. The data sets were sourced from an online database titled *Gapminder* which has been a reliable provider of data since 2005, in hopes of promoting global sustainability through easily accessible information.Before analysis, the data had to be properly cleaned and wrangled. The CO2 data set contained values in the form of 25.3k and 4.9M, signaling the units of thousands and millions. The first step in the data cleaning process was to replace these characters with numbers to create a numeric variable which calculations could be performed on. Once the data was of all the same type, the data was then pivoted to be in tidy form and the years of interest (1989 -- 2017) were selected. Once pivoted, the data sets were joined by year and country to produce a clean master data set which shall be used for model fitting, plotting, and analysis. A final decision was made to drop all the Na's in the master set due to a few reasons. The missing values typically occurred in the `Average Percent Renewable` variable across the earlier years measured and usually in countries with smaller populations. Whether these Na's were included in the data due to lack of observation or for simply not having any renewable energy production or consumption it is unclear. Because the averages are being studied and the Na's usually occured in smaller countries, there was still a large enough sample size to average over once those observations were dropped. Other forms of imputation were contemplated such as cell-mean imputation but not conducted due to fears of introducing bias to the data.Linear regression, a method involving predicting the values of one variable, based on another, through producing a straight line minimizing the value for the sum of squared residuals, was used to create and predict a model. All the data cleaning, models, and subsequent analysis was conducted using R code.\n \n \n```{r}datatable((head(master, n =50)),caption ='Interactive Preview of Data Set')```\n# Analysis and Discussion of Model```{r}#| output: falseenergy_emissions_model <-lm(`Average Emissions`~`Average Percent Renewable`,data = master)summary(energy_emissions_model)tidy(energy_emissions_model)```**Regression Equation:**$$ \hat{y} = 693309.6 - 17770.2 * Average\,Percent\,Renewable $$Above is the simple linear regression equation based on the model predicting the response variable, `Average Emissions` ($\hat{y}$), by the explanatory variable, `Average Percent Renewable`. The coefficient on `Average Percent Renewable` is extremely negative sitting at -17770.2 meaning that for each one percent increase in the `Average Percent Renewable` energy used the average amount of CO2 emissions in thousands of tons decreases by 17770.2. The slope coefficient of 693309.6 means that when the `Average Percent Renewable` is zero such that there is no renewable energy being used at all on average, the predicted `Average Emissions` would be 693309.6 thousand tons.\n```{r}#| fig-align: center# Plot 1raw_graph <- master |>ggplot(aes(x =`Average Percent Renewable`, y =`Average Emissions`) ) +geom_point() +geom_smooth(method ="lm") +theme(legend.position ="none") +labs(x =" Average Renewable Energy Consumption (%)", y ="", title ="Relationship between Renewable Energy Usage and CO2 Emissions", subtitle ="Average CO2 Emissions (1000 tonnes)") +theme(plot.title =element_text(hjust =0.5, face ='bold'),plot.subtitle =element_text(size =10),axis.title.x =element_text(size =10))raw_graph```The graph above demonstrates the relationship between `Average Emissions` and `Average Percent Renewable`. The distribution illustrates a negative linear relationship, where the points are relatively close to the plotted regression line with little deviation and noise. There are little to no unusual observations. This illustration is consistent with the hypothesis that as the `Average Percent Renewable` increases the `Average Emissions` decreases at a significant rate.```{r}#| fig-align: centerco2_by_year_graph <-plot_ly( master, x =~ Year, y =~`Average Emissions`,type ='scatter',marker =list(color ='red'))co2_by_year_graph <- co2_by_year_graph |>layout(title ='Average Emissions Over Time',yaxis =list(title ='Average Emissions (1000 tons)',titlefont =list(size =14),xaxis =list(title ='Year',titlefont =list(size =14))))energy_by_year_graph <-plot_ly( master, x =~ Year, y =~`Average Percent Renewable`,marker =list(color ='green'),type ='scatter')energy_by_year_graph <- energy_by_year_graph |>layout(title ='Average Percent Renewable Over Time',yaxis =list(title ='Average Percent Renewable',titlefont =list(size =14)),xaxis =list(title ='Year',titlefont =list(size =14)))co2_by_year_graphenergy_by_year_graph```As shown by the two distributions of `Average Emissions` and `Average Percent Renewable` over time, the relationship follows a negative relationship, but perhaps not as expected. `Average Percent Renewable` is decreasing over the years while `Average Emissions` is increasing which still illustrates a negative relationship. As time goes on it makes sense as to why `Average Emissions` is increasing, because of extreme population growth and growing demand for production but `Average Percent Renewable` has shockingly been declining in recent years. While this likely has something to due to the varying definition of renewable energy, for example whether or not nuclear energy is truly renewable, it is surprising that as technology develops renewable energy use does not. This signifies that humanity needs to increase the renewable energy production and usage on average in order to reduce the carbon footprint and preserve nature for future generations.**Model Fit:**```{r}energy_emissions_model |>augment() |>summarize(`Variance of Fitted`=var(.fitted),`Variance of Residuals`=var(.resid),`Variance of Average CO2 Emissions`=var(`Average Emissions`)) |>kable(caption ='Model Fit',digits =3,format.args =list(big.mark =",")) |>kable_styling(bootstrap_options =c('striped', 'bordered'))```The proportion of variability in the response values that was accounted for by the model, $R^{2}$, was very large at about at about 89.93 percent. This suggests a good quality model, where a lot, about 89%, of the variation in the response, `Average Emissions` is explained by the explanatory variable, `Average Percent Renewable`. This suggests that a high proportion of variability in response is accounted for by the linear model and there are not many other large factors influencing emissions.```{r}#| fig-align: centerenergy_emissions_model |>augment() |>ggplot(aes(x=.fitted, y = .resid)) +geom_point() +labs(y ='',subtitle ='Residuals',x ='Fitted Values',title ='Relationship between Residual and Fitted Values') +theme(plot.title =element_text(hjust =0.5, face ='bold'))```\n**Simulation:**Simulation is a critical technology to develop planning and explore models to optimized decisions making (de Paula Ferreira et al., 2020).In this part, we will perform a basic linear model simulation to see how well the model is with the presetting conditions, such as adding normally distributed error to the linear regression line.The basic procedure in this study is:1. Make a linear regression fit model for the observed data (already done in the previous part).2. Assume the model is right. Add generated error to the linear regression model (we generated normally distributed error for this study).3. Getting the simulated data, and compare with the observed data (by generating value distribution graphs, scatterplots of the relationships modeled and observed, and y=x plot).4. Check and interpret the simulated $R^2$ value.5. Iterating and generating simulated data sets.6. Check, interpret, and plot the simulated $R^2$ values for the simulated data sets.Simulation for a single data set:```{r}noise <-function(x, mean =0, sd){ x +rnorm(length(x), mean, sd)}``````{r}master_predict <-predict(energy_emissions_model)master_sigma <-sigma(energy_emissions_model)sim_response <-tibble(sim_emissions =noise(master_predict, sd = master_sigma))raw_graph <- master |>ggplot(aes(x =`Average Percent Renewable`, y =`Average Emissions`) ) +geom_point() +geom_smooth(method ="lm") +theme(legend.position ="none") +labs(x =" Avg Renewable energy consumption percentage", y ="", title ="Relationships between Avg Renewable Energy Consumption and \nAvg CO2 Emissions among different contries", subtitle ="Avg CO2 Emissions/1000 tonnes") +theme(plot.title.position ="plot")``````{r}obs_emissions <- master |>ggplot(aes(x =`Average Emissions`)) +geom_histogram(binwidth =3000) +labs(x ="Observed Emissions",y ="",subtitle ="Count",title ="Worldwidely Yearly Emission Counts") +theme_bw()+theme(plot.title =element_text(hjust =0.5, face ='bold'))sim_emissions_graph <- sim_response |>ggplot(aes(x = sim_emissions)) +geom_histogram(binwidth =3500) +labs(x ="Simulated Emissions",y ="",subtitle ="Count",title ="Simulated Worldwidely Yearly Emission Counts") +theme_bw()+theme(plot.title =element_text(hjust =0.5, face ='bold'))obs_emissions + sim_emissions_graph```We can see the differences between observed and simulated data from the distribution visualization. When we first look at the two graphs, they seem not quite similar. However, when we take a close look, the concentrated value ranges are similar.The left graph described the distribution of the observed yearly emissions data counts world widely. We can see that the most data are within $1.1\times10^6$ tonnes to $1.2\times10^6$ tonnes and $1.6\times10^6$ tonnes to $1.7\times10^6$ tonnes. The simulated yearly emissions data are also concentrated in a range of $1.1\times10^6$ tonnes to $1.2\times10^6$ tonnes and $1.5\times10^6$ tonnes to $1.7\times10^6$ tonnes.```{r}sim_data <- master |>filter(!is.na(`Average Emissions`), !is.na(`Average Percent Renewable`) ) |>select(`Average Emissions`, `Average Percent Renewable`) |>bind_cols(sim_response)raw_graph <- master |>ggplot(aes(x =`Average Percent Renewable`, y =`Average Emissions`) ) +geom_point() +theme(legend.position ="none") +labs(x =" Avg Renewable energy consumption percentage", y ="", title ="Observed Relationships between Avg Renewable Energy Consumption and \nAvg CO2 Emissions among different contries", subtitle ="Avg CO2 Emissions/1000 tonnes") +theme(plot.title.position ="plot")sim_master_graph <- sim_data |>ggplot(aes(x =`Average Percent Renewable`, y = sim_emissions) ) +geom_point() +theme(legend.position ="none") +labs(x =" Avg Renewable energy consumption percentage", y ="", title ="Simulated Relationships between Avg Renewable Energy Consumption and \nAvg CO2 Emissions among different countries", subtitle ="Simulated Avg CO2 Emissions/1000 tonnes") +theme(plot.title.position ="plot")raw_graph + sim_master_graph```In these two graphs, we plotted the relationships between simulated/ observed average renewable energy consumption and average CO2 emissions among different countries. We can see the the simulated data in the right graph is more close to a straight line, which is the modeled linear regression line.```{r}# Check the similarity between the simulated data and observed datasim_data |>ggplot(aes(x = sim_emissions, y =`Average Emissions`) ) +geom_point() +labs(x ="Simulated Avg CO2 Emissions/1000 tonnes", y ="",title ="The Similarity between Observed Data and Simulated Data",subtitle ="Avg CO2 Emissions/1000 tonnes" ) +geom_abline(slope =1,intercept =0, color ="steelblue",linetype ="dashed",lwd =1.5) +theme_bw()+theme(plot.title =element_text(hjust =0.5, face ='bold'))```In this graph, we can check the similarity between the data from observed data set and the simulated data set. Ideally, the data should fall on the blue dashed line, which is y=x line. The dots/ data above the blue line over estimated/ simulated. The dots/ data below the blue line is under estimated/ simulated. In this study, the observed data is more under estimated. Generally speaking, the dots are close to the y=x line. In this case, there is a strong relationship between the observed values and simulated values.p value```{r}# check how the simulated data fit the observed data, especially the P value and R square.p_value<-lm(`Average Emissions`~ sim_emissions, data = sim_data ) |>glance() |>select(p.value)|>pull()p_value```r square```{r}# get the R squared for the simulated data fit for observed data.sim_r2 <-lm(`Average Emissions`~ sim_emissions, data = sim_data ) |>glance() |>select(r.squared) |>pull()sim_r2```p value is extremely small (p\<0.05), which indicates the data is statistically significant.$R^2$ is 0.803297, which means the simulated data can explain around 80% variation of the observed data. In this case, we suppose the one time simulation till now did a fairly good job.Simulation for a 1000 data sets:```{r}# Created 1000 simulated datasetnsims <-1000sims <-map_dfc(.x =1:nsims,.f =~tibble(sim =noise(master_predict, sd = master_sigma) ) )```Since the normally distributed error is randomly added to the linear regression model line. If we iterate for more times, the result will be more stable. In this way, we can get the entire group of interest, rather than a single sample of that group.In the following steps, we got 1000 simulated data sets and combined the observed data set to a new full data sets. Then, we also got the $R^2$ of 1000 simulated data sets and plotted the distribution of them.
<<<<<<< HEAD
Simulated R squared distribution```{r}# to see the distribution of the 1000 simulated R square.tibble(sims = sim_r_sq) |>ggplot(aes(x = sims)) +geom_histogram(binwidth =0.025) +labs(x =expression("Simulated"~ R^2),y ="",subtitle ="Number of Simulated Models")#The distribution of these values will tell if our assumed model does a good job of producing data similar to what was observed. If the model produces data similar to what was observed, we would expect values near 1.# Our model is R square is around 0.8.```# Referenceshttps://www.gapminder.org/data/https://plotly.com/r/line-and-scatter/#custom-color-scales
>>>>>>> 5b7bbb9045f9cf8201a690ba8e5c5c84ba477ebd
=======
```{r}# clean the colnamescolnames(sims) <-colnames(sims) |>str_replace(pattern ="\\.\\.\\.",replace ="_")```Original Average Emission data with 1000 simulated datasets```{r}# bind 1000 simulated dataset and the observed dataset togethersims <- master |>filter(!is.na(`Average Emissions`), !is.na(`Average Percent Renewable`)) |>select(`Average Emissions`) |>bind_cols(sims)head(sims)```R squared for 1000 simulated datasets.```{r}# mapping to get 1000 simulated R squared.sim_r_sq <- sims |>map(~lm(`Average Emissions`~ .x, data = sims)) |>map(glance) |>map_dbl(~ .x$r.squared)head(sim_r_sq)```Simulated R squared distribution```{r}# to see the distribution of the 1000 simulated R square.tibble(sims = sim_r_sq) |>ggplot(aes(x = sims)) +geom_histogram(binwidth =0.025) +labs(x =expression("Simulated"~ R^2),y ="",subtitle ="Number of Simulated Models",title ="1000 Simulated R-Squared Distribution")+theme(plot.title =element_text(hjust =0.5, face ='bold'))#The distribution of these values will tell if our assumed model does a good job of producing data similar to what was observed. If the model produces data similar to what was observed, we would expect values near 1.# Our model is R square is around 0.8.```We can see from the distribution of the 1000 simulated $R^2$ graph that the $R^2$ have the highest frequency around 0.8, which is in accordance with the single simulated data $R^2$.# References1\. https://www.gapminder.org/data/ https://plotly.com/r/line-and-scatter/#custom-color-scales2. de Paula Ferreira, W., Armellini, F., & De Santa-Eulalia, L. A. (2020). Simulation in industry 4.0: A state-of-the-art review. Computers & Industrial Engineering, 149, 106868. <https://doi.org/10.1016/J.CIE.2020.106868>
>>>>>>> 2d6ac341fee300eb63ebafc8e1983aa7aacb6262